-
Notifications
You must be signed in to change notification settings - Fork 0
Ett 1200 facet search error #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works as expected. See the original ticket for a follow-up about the display of this previously-broken facet now that we can select it without an error -- the quotes are missing when the facet is in the "Current Filters". May be out of scope and it's just cosmetic. APPROVE
EDIT: I just put a screen shot on the ticket
aelkiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main question I have is if we should be doing something more general for escaping queries. In particular it doesn't seem like this will address the issues with characters like ~ and \. Given that these don't cause issues in ls, I think it's worth seeing if we can apply the more general escaping strategy.
bc9f577 to
7f4f26d
Compare
4c3f1d5 to
e8fcf91
Compare
…e removed or escaped to avoid Solr error parser or field injection issue. Adding unit test for all the functions involve on creating the Solr query tests for the functions build_and_or_onephrase, tokenizeInput. Refactoring validateInput function, CleanUp build_and_or_onephrase and Apply escape to create q and fq Solr fields. Removing illegal characters if the query input is invalid, otherwise escaped the input query. Redefined the logic to escape q field considering different escaping rule to each semantic representation. Detecting different kind of unbalanced quotes when the SearchStructure is created. Removing statement on smartquote_problem_talking_catalog test for testing. Fixing playwright test to check unbalance funcy quotes. Integrate the normalization of the fields lcnormalized and stdnum. Document the logic to escape Solr input queries. Handle multiple issues in invalid input queries.
e8fcf91 to
490cacf
Compare
aelkiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried locally with a variety of malformed things and wasn't able to get any errors from solr. I also verified the quotes in facets works now.
We should probably remove the remaining error_log calls that were added; also see the comments on some of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure whether it makes sense to test the behavior with e.g. ~2 here, since that isn't syntax we claim to support. It isn't necessarily be a problem to pass it through to Solr, but I'm not sure why so many of the test cases here have it.
| $tokens = $this->solr->tokenizeInput('table AND "chair leg"~2'); | ||
|
|
||
| $this->assertSame( | ||
| ['table AND "chair leg"~2'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is the behavior we want, although again, it doesn't really matter as we don't claim to support boolean operators in search strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to tokenize input queries is based on a regular expression. I created some tests to check if the expected behavior. For this specific test, I wanted to confirm that tokenizeInput() captures and does not split phrases containing boolean expressions. The function also keeps expressions such as fuzzy searches like "hello world"~5; and exact phrase - double-quoted string.
| public function testAcceptsValidFieldedQueries(): void | ||
| { | ||
| $valid = [ | ||
| 'title:table', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here this isn't syntax we claim to support and it won't work as expected.
For example http://localhost:8080/Search/Home?lookfor=author%3Achaucer&searchtype=all returns no results; you need to specifically search by author: http://localhost:8080/Search/Home?lookfor=chaucer&searchtype=author
I don't think we need to reject these kinds of queries outright (and we definitely don't want to get a solr error), but the test is maybe misleading in that it suggests to a reader that these queries would behave as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment to this test to clarify that it checks whether an input query is valid. It's not about the types of queries we support for search.
For example, if the user types a query like author: 'author:smith' http://localhost:8080/Search/Home?lookfor=title%3Atable&searchtype=title
Our search algorithm will understand that the field used for the query is author and that the query string is author:smith. This query is not rejected, but the result is empty because author:smith does not exist in our index.
sys/Solr.php
Outdated
| $args = array_merge($args, $this->spellcheckComponents($ss)); | ||
| } | ||
|
|
||
| error_log("Action before simplesearch: " . $action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debugging stuff -- should remove
sys/Solr.php
Outdated
| die(); | ||
| } | ||
|
|
||
| error_log("Solr action used: " . $action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another extra logging we probably don't want
| * - ' "table" ' → table | ||
| * - '"table name"' → table name | ||
| * - 'table "name"' → table "name" (unchanged) | ||
| * - '"table"name"' → "table"name" (unchanged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment appears to be at odds with the behavior of the code. I would expect "table"name" to become table"name.
I also recommend adding a test for this method. Here's one:
/* ============================================================
* remove_quotes()
* ============================================================
*/
/**
* @covers Solr::remove_quotes
*/
public function testRemoveQuotes(): void
{
$this->assertSame(
'table',
$this->solr->remove_quotes('"table"')
);
$this->assertSame(
'table',
$this->solr->remove_quotes(' "table" ')
);
$this->assertSame(
'table name',
$this->solr->remove_quotes('"table name"')
);
$this->assertSame(
'table "name"',
$this->solr->remove_quotes('table "name"')
);
$this->assertSame(
'table"name',
$this->solr->remove_quotes('"table"name"')
);
}
Issue: Catalog search fails with a different input query because Solr special characters are not escaped.
The main goal of this task is to fix the search algorithm for preventing query parser errors and injection.
Ticket: ETT-1200
As part of this PR:
The search algorithm in production consist on:
*:*.~, \, table~~2, ~~~///What changes have been implemented on the current PR?
*:*.~, \, table~~2, ~~~///qfield is created.fqfield is createdHow to test:
Next step:
lucene_escapeshould be replaced bylucene_escape_fq.